Nobel Prize Winners Analysis¶

Pandas & Seaborn¶

Preliminary Discussion¶

The Nobel Prize is perhaps one of the most widely recognized and known scientific awards. Besides, the honour and prestige, as well as notable prize money that comes with winning Nobel Prize, winners also receive a gold medal showing Alfred Nobel (1833 - 1896) who established the prize. So, each Nobel Prize consists of a gold medal, a diploma bearing a citation, and a sum of money, the amount of which depends on the income of the Nobel Foundation.

The Nobel Prizes are five separate prizes that, according to Alfred Nobel's will of 1895, are awarded to "those who, during the preceding year, have conferred the greatest benefit to mankind". Alfred Nobel was a Swedish chemist, engineer, and industrialist most famously known for the invention of dynamite. So every the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace. The awarding of Nobel Prize's dates back to 1901.

A Nobel laureate is a recipient of the Nobel Prize. The award is given annually for outstanding achievement in the fields of physics, chemistry, medicine or physiology, literature, and economics, and for the promotion of peace. It is widely considered one of the most prestigious awards in these fields. At the beginning of October, the Nobel Committee chooses the Nobel Peace Prize laureates through a majority vote. The decision is final and without appeal. The names of the Nobel Peace Prize laureates are then announced. December – Nobel Prize laureates receive their prize.

Setup¶

Here we shall load the libraries used in this analysis. We soon after read in our data set, which was taken from the Nobel Foundation, which has made a dataset available of all prize winners from the start of the prize, in 1901, to 2016.

# Loading in required libraries
import pandas as pd                 # for used for data cleaning and analysis
import seaborn as sns               # for making statistical graphics
import numpy as np                  # for working with ndarray
import matplotlib.pyplot as plt     # for plotting

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reading in the Nobel Prize data
nobel = pd.read_csv('datasets/nobel.csv')

Cleaning and Manipulating¶

Here we shall look at the structure of the data and see if we need to change the variable types or need to create/remove additional variables.

# Structure of Data Set
nobel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   year                  911 non-null    int64 
 1   category              911 non-null    object
 2   prize                 911 non-null    object
 3   motivation            823 non-null    object
 4   prize_share           911 non-null    object
 5   laureate_id           911 non-null    int64 
 6   laureate_type         911 non-null    object
 7   full_name             911 non-null    object
 8   birth_date            883 non-null    object
 9   birth_city            883 non-null    object
 10  birth_country         885 non-null    object
 11  sex                   885 non-null    object
 12  organization_name     665 non-null    object
 13  organization_city     667 non-null    object
 14  organization_country  667 non-null    object
 15  death_date            593 non-null    object
 16  death_city            576 non-null    object
 17  death_country         582 non-null    object
dtypes: int64(2), object(16)
memory usage: 128.2+ KB

We have 18 variables in this data set, from year, category, prize, motivation, sex, and others. We shall look at the distributions of these variables individually and in pairs, for any relationships that could exist between variables.

Note the following description of the some of the variables:

prize_share: Portion of the amount
laureate_id: unique ID of winner
laureate_type: either an individual or organisation
motivation: reason as to why they were awarded the prize

We note that some of the variables are not stored as correct data types. For example, we shall change the death_date and birth_date variables to date variables.

# convert columns birth_date and death_date to data
nobel[["birth_date"]] = nobel[["birth_date"]].apply(pd.to_datetime)
nobel[["death_date"]] = nobel[["death_date"]].apply(pd.to_datetime)

# Taking a look at data set
nobel.head(n=6)

Univariate Analysis¶

Here we shall look athe distributions of the variables in our data set.

# Summary of Numeric Variables
nobel.describe().transpose()

Our data set has data from 1901 (starting year for prize) all the way to 2016, indicated by the range from describe() function. As highlighted earlier laurete_id is simply a unique identifier for the winners and holds no inherent analysis value.

nobel.category.value_counts().plot(kind = 'bar', figsize = (15,5))
plt.xlabel("Category of Prize", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Category", y=1.02);

Most prizes are awarded to individuals in fields of medicine and then closely behind, Physics. Economics recieved the lowest number of prizes of the award's existence, however, it is worth noting that the economics Nobel Prize was only introduced in 1969. The specific counts are given below per category.

# Category Counts
counts_category = nobel.category.value_counts()
counts_category

Medicine      211
Physics       204
Chemistry     175
Peace         130
Literature    113
Economics      78
Name: category, dtype: int64

Let's look at who the awards were given to - individuals or organisations?

# Laureate Counts
nobel['laureate_type'].value_counts()

Individual      881
Organization     30
Name: laureate_type, dtype: int64

We have that majority of the awards are given to individuals. Now lets look at which were the top 10 cities were Laureates appear to be from.

# Top 10 Cities where Laureates are from
nobel['birth_city'].value_counts().head(10)

New York, NY      45
Paris             25
London            19
Vienna            14
Chicago, IL       12
Berlin            10
Budapest           8
Brooklyn, NY       8
Boston, MA         8
Washington, DC     7
Name: birth_city, dtype: int64

We also want to know which are the top 10 organisations where laureates are from.

# Top 10 Organisations where Laureates are from
nobel['organization_name'].value_counts().head(10)

University of California                        32
Harvard University                              26
Stanford University                             18
Massachusetts Institute of Technology (MIT)     18
University of Chicago                           17
University of Cambridge                         17
California Institute of Technology (Caltech)    16
Columbia University                             15
Princeton University                            14
Rockefeller University                          11
Name: organization_name, dtype: int64

And finally, it would be nice to know which countries the laeureates are from and what the top 10 birth countries are.

# Top 10 Country where Laureates are from
nobel['organization_country'].value_counts().head(5)

United States of America       341
United Kingdom                  89
Germany                         43
France                          36
Federal Republic of Germany     23
Name: organization_country, dtype: int64

# Display the number of prizes won by the top 10 nationalities.
nobel['birth_country'].value_counts().head(10)

United States of America    259
United Kingdom               85
Germany                      61
France                       51
Sweden                       29
Japan                        24
Canada                       18
Netherlands                  18
Italy                        17
Russia                       17
Name: birth_country, dtype: int64

Lets try to visually summarise these findings in one visualisation. See below.

plt.rcParams["figure.figsize"] = (20,20)

# number of prizes won by the top 10 nationalities.
plt.subplot(4, 2, 1)
nobel['birth_country'].value_counts().head(10).plot(kind = 'bar')
plt.xlabel("Birth Country", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Birth Country of Laureates", y=1.02)


# Top 10 Country where Laureates are from
plt.subplot(4, 2, 2)
nobel['organization_country'].value_counts().head(10).plot(kind = 'bar')
plt.xlabel("Country", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Country", y=1.02)

# Top 10 Organisations where Laureates are from
plt.subplot(4, 2, 3)
nobel['organization_name'].value_counts().head(10).plot(kind = 'bar')
plt.xlabel("Organisation Name", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Organisation", y=1.02)



# Top 10 Cities where Laureates are from
plt.subplot(4, 2, 4)
nobel['birth_city'].value_counts().head(10).plot(kind = 'bar')
plt.xlabel("Birth City", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Birth City of Laureates", y=1.02)

# Show Plots
plt.tight_layout()
plt.show();

We see a clear USA dominance. This is maybe not so surprising perhaps, as it is widely known that a lot of famous researchers and scientists that are somewhat "mainstream" were from America. So, given that the most common Nobel laureate between 1901 and 2016 was an individual born in the United States of America, was this always the case?

first_half = nobel[["year", "category", "birth_country", "sex", "full_name"]][nobel.year <=1959]
last_half = nobel[["year", "category", "birth_country", "sex", "full_name"]][nobel.year > 1959]

print("\n \n First Half Century Birth Country Counts: \n \n",first_half.birth_country.value_counts().head(10))
print("\n \n last half century birth country counts: \n \n",last_half.birth_country.value_counts().head(10))

 
 First Half Century Birth Country Counts: 
 
 United States of America               56
United Kingdom                         35
France                                 30
Germany                                30
Sweden                                 14
Switzerland                            10
Denmark                                 9
Netherlands                             9
Russia                                  8
Italy                                   7
Scotland                                7
Prussia (Germany)                       6
Belgium                                 6
Austria                                 6
Prussia (Poland)                        5
Spain                                   5
Germany (Poland)                        4
Norway                                  4
Ireland                                 3
Canada                                  3
Austria-Hungary (Czech Republic)        3
China                                   3
Russian Empire (Poland)                 3
India                                   3
Russian Empire (Finland)                2
Austria-Hungary (Hungary)               2
Argentina                               2
Germany (Russia)                        2
Australia                               2
Austrian Empire (Austria)               2
Russian Empire (Ukraine)                2
Schleswig (Germany)                     2
Poland                                  2
Austria-Hungary (Poland)                1
Austria-Hungary (Croatia)               1
Portugal                                1
Austria-Hungary (Austria)               1
Japan                                   1
South Africa                            1
Germany (France)                        1
Iceland                                 1
W&uuml;rttemberg (Germany)              1
Chile                                   1
Russian Empire (Latvia)                 1
Java, Dutch East Indies (Indonesia)     1
Austria-Hungary (Slovenia)              1
Prussia (Russia)                        1
Mecklenburg (Germany)                   1
Hesse-Kassel (Germany)                  1
Luxembourg                              1
East Friesland (Germany)                1
New Zealand                             1
Austrian Empire (Italy)                 1
British India (India)                   1
Bavaria (Germany)                       1
Tuscany (Italy)                         1
Hungary (Slovakia)                      1
Austrian Empire (Czech Republic)        1
Faroe Islands (Denmark)                 1
French Algeria (Algeria)                1
Name: birth_country, dtype: int64

 
 last half century birth country counts: 
 
 United States of America              203
United Kingdom                         50
Germany                                31
Japan                                  23
France                                 21
                                     ... 
Austria-Hungary (Ukraine)               1
Costa Rica                              1
Austria-Hungary (Hungary)               1
Tibet (People's Republic of China)      1
Ukraine                                 1
Name: birth_country, Length: 97, dtype: int64

#Pre 1959 Country Winners
first_half.birth_country.value_counts().head(10).plot(kind = 'bar', figsize = (15,5))
plt.xlabel("Country", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Country Before 1959", y=1.02)
plt.show()


#Post 1959 Country Winners
last_half.birth_country.value_counts().head(10).plot(kind = 'bar', figsize = (15,5))
plt.xlabel("Country", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Country After 1959", y=1.02)
plt.show();

From splitting the years in half and looking at the first 58 years of the Nobel Prize's (from 1901 to 1959) and comparing the results of the countries to that of the latter half of laureates, we see that European countries were fare superior, in terms of the number of laureates, compared to more recent time (between 1959 and 2016. So when did USA actually start to dominate the Nobel Prize charts?

# Calculating the proportion of the USA born winners per decade
nobel['usa_born_winner'] = nobel['birth_country']=="United States of America"                   # add column (boolean) for USA BORN WINNER
nobel['decade'] = (np.floor(nobel['year'] / 10) * 10).astype(int)                               # add column for decade which award was given
prop_usa_winners = nobel.groupby('decade', as_index=False)['usa_born_winner'].mean()
prop_usa_winners_years = nobel.groupby('year', as_index=False)['usa_born_winner'].mean()

# Setting the plotting theme
sns.set()

# setting the size of all plots.
plt.rcParams['figure.figsize'] = [11, 7]

# Plotting USA born winners
ax = sns.lineplot(x=prop_usa_winners['decade'], y=prop_usa_winners['usa_born_winner'])
ax.set_title("Proportion of USA Born Winners by Decade")
ax.set_xlabel("Decade")
ax.set_ylabel("Proportion of USA Born Winner")

# Adding %-formatting to the y-axis
from matplotlib.ticker import PercentFormatter
ax.yaxis.set_major_formatter(PercentFormatter(1.0));

# Plotting USA born winners yearly
ax = sns.lineplot(x=prop_usa_winners_years['year'], y=prop_usa_winners_years['usa_born_winner'])
ax.set_title("Proportion of USA Born Winners by Year")
ax.set_xlabel("Year")
ax.set_ylabel("Proportion of USA Born Winner");

We clearly identify an increasing trend here. USA born winners appears to grow steadily in the beginning of the century (20th century) between 1900 and 1920. It then rapidly increases till about 1940. From here we see steady growth in number of USA born winners. In the year 2000 it peaked at over 40% of winners were born in USA.

Analysis into Sex Characteristic¶

So as we have established in the previous section, that the USA became the dominating winner of the Nobel Prize first in the 1930s and had kept the leading position ever since. An interesting path to follow int the analysis would be to explore how the distribution of sex amongst the laureates from 1901 to 2016.

# Display the number of prizes won by male and female recipients
nobel['sex'].value_counts()

Male      836
Female     49
Name: sex, dtype: int64

We see that the vast majority (0ver 90%) of the winners are male. A mere 5.86% (around 49 individuals out of 885) are female. We shall continue to explore this imbalance, and investigate if it is better or worse within specific prize categories like physics, medicine, literature, etc.

# Calculating the proportion of female laureates per decade
nobel['female_winner'] = nobel['sex']=='Female'                                                          # add new column for if winner was female
prop_female_winners = nobel.groupby(['decade','category'], as_index=False)['female_winner'].mean()       #new variable for proportion of female winner per decade

# Plotting USA born winners with % winners on the y-axis
ax = sns.lineplot(x='decade', y='female_winner', hue='category', data=prop_female_winners)
ax.yaxis.set_major_formatter(PercentFormatter(1.0));

This line plot shows some interesting trends and patterns. It appears most female awards, particularly in the earlier years were mainly for Literature - this is particularly prevalent between 1920's and 1940's. We all see literature pick up again for females from 1980's. Overall the imbalance is pretty large with physics, economics, and chemistry having the largest imbalance. The imbalance appears to quite great in the most recent years, were there is large disparity between proportion of awards for females between the categorises. Medicine has a somewhat positive trend, and since the 1990s the literature prize is also now more balanced. The big outlier is the peace prize during the 2010s, but keep in mind that this just covers the years 2010 to 2016.

The exact count of female winners in each category is given below.

# Number of Female Winners in each Category
female_winners = nobel[["year", "category","full_name"]][nobel.female_winner == True]
female_winners.category.value_counts()

Peace         16
Literature    14
Medicine      12
Chemistry      4
Physics        2
Economics      1
Name: category, dtype: int64

Laureates with Multiple Awards¶

For most scientists/writers/activists a Nobel Prize would be the crowning achievement of a long career. But for some people, one is just not enough, and few have gotten it more than once. Who are these lucky few?

# Selecting the laureates that have received 2 or more prizes.
nobel.groupby('full_name').filter(lambda group: len(group) >= 2)

# People who received more than one Nobel
repeated_awards = nobel['full_name'].value_counts()
repeated_awards[repeated_awards>=2]

Comité international de la Croix Rouge (International Committee of the Red Cross)    3
Office of the United Nations High Commissioner for Refugees (UNHCR)                  2
John Bardeen                                                                         2
Linus Carl Pauling                                                                   2
Frederick Sanger                                                                     2
Marie Curie, née Sklodowska                                                          2
Name: full_name, dtype: int64

The list of repeat winners contains some illustrious names. We see that 4 distinct individuals have won the award twice. Most notably, Marie Curie, a female, achieved this unprecendented feat in the category of physics in 1903 for discovering radiation and chemistry in 1911 for isolating radium and polonium. John Bardeen got it twice in physics for transistors and superconductivity, Frederick Sanger got it twice in chemistry, and Linus Carl Pauling got it first in chemistry and later in peace for his work in promoting nuclear disarmament.

We also learn that organizations also get the prize as both the Red Cross and the UNHCR have gotten it twice.

Age of Laureates¶

Now we turn our attention to the aspect of age, to investigate as to how old are you generally when you get the prize.

# Converting birth_date from String to datetime
nobel['birth_date'] = pd.to_datetime(nobel['birth_date'])

# Calculating the age of Nobel Prize winners
nobel['age'] = nobel['year'] - nobel['birth_date'].dt.year

# Plotting the age of Nobel Prize winners
ax = sns.lmplot(x='year', y='age', data=nobel, lowess=True, aspect=2, line_kws={'color' : 'black'});

From the plot, we see that people use to be around 55 when they received the price, but nowadays the average is closer to 65. These values are indicated by the white line running through the scattering of points (regression). But there is a large spread in the laureates' ages, and while most are 50+, some are very young.

We also see that the density of points is much high nowadays than in the early 1900s -- nowadays many more of the prizes are shared, and so there are many more winners.

We also see that there was a disruption in awarded prizes around the Second World War (1939 - 1945).

Next we shall explore the age trends within different prize categories.

# Same plot as above, but separate plots for each type of Nobel Prize
sns.lmplot(x='year', y='age', data=nobel, row='category' ,lowess=True, aspect=2, line_kws={'color' : 'black'});

For the rows of plots, we see that both winners of the chemistry, medicine, and physics prize have gotten older over time. The trend is strongest for physics: the average age used to be below 50, and now it's almost 70. Literature and economics are more stable. We also see that economics is a newer category. But peace shows an opposite trend where winners are getting younger! In the peace category we also a winner around 2010 that seems exceptionally young.

The first prize in economic sciences was awarded to Ragnar Frisch and Jan Tinbergen in 1969, hence why the data only starts from that year.

# The oldest winner of a Nobel Prize as of 2016
display(nobel.nlargest(1, 'age'))

# The youngest winner of a Nobel Prize as of 2016
display(nobel.nsmallest(1, 'age'))

	year	category	prize	motivation	prize_share	laureate_id	laureate_type	full_name	birth_date	birth_city	birth_country	sex	organization_name	organization_city	organization_country	death_date	death_city	death_country
0	1901	Chemistry	The Nobel Prize in Chemistry 1901	"in recognition of the extraordinary services ...	1/1	160	Individual	Jacobus Henricus van 't Hoff	1852-08-30	Rotterdam	Netherlands	Male	Berlin University	Berlin	Germany	1911-03-01	Berlin	Germany
1	1901	Literature	The Nobel Prize in Literature 1901	"in special recognition of his poetic composit...	1/1	569	Individual	Sully Prudhomme	1839-03-16	Paris	France	Male	NaN	NaN	NaN	1907-09-07	Châtenay	France
2	1901	Medicine	The Nobel Prize in Physiology or Medicine 1901	"for his work on serum therapy, especially its...	1/1	293	Individual	Emil Adolf von Behring	1854-03-15	Hansdorf (Lawice)	Prussia (Poland)	Male	Marburg University	Marburg	Germany	1917-03-31	Marburg	Germany
3	1901	Peace	The Nobel Peace Prize 1901	NaN	1/2	462	Individual	Jean Henry Dunant	1828-05-08	Geneva	Switzerland	Male	NaN	NaN	NaN	1910-10-30	Heiden	Switzerland
4	1901	Peace	The Nobel Peace Prize 1901	NaN	1/2	463	Individual	Frédéric Passy	1822-05-20	Paris	France	Male	NaN	NaN	NaN	1912-06-12	Paris	France
5	1901	Physics	The Nobel Prize in Physics 1901	"in recognition of the extraordinary services ...	1/1	1	Individual	Wilhelm Conrad Röntgen	1845-03-27	Lennep (Remscheid)	Prussia (Germany)	Male	Munich University	Munich	Germany	1923-02-10	Munich	Germany

	count	mean	std	min	25%	50%	75%	max
year	911.0	1969.201976	32.837978	1901.0	1946.0	1975.0	1997.0	2016.0
laureate_id	911.0	462.515917	270.236159	1.0	228.5	457.0	698.5	937.0

	year	category	prize	motivation	prize_share	laureate_id	laureate_type	full_name	birth_date	birth_city	...	sex	organization_name	organization_city	organization_country	death_date	death_city	death_country	usa_born_winner	decade	female_winner
19	1903	Physics	The Nobel Prize in Physics 1903	"in recognition of the extraordinary services ...	1/4	6	Individual	Marie Curie, née Sklodowska	1867-11-07	Warsaw	...	Female	NaN	NaN	NaN	1934-07-04	Sallanches	France	False	1900	True
62	1911	Chemistry	The Nobel Prize in Chemistry 1911	"in recognition of her services to the advance...	1/1	6	Individual	Marie Curie, née Sklodowska	1867-11-07	Warsaw	...	Female	Sorbonne University	Paris	France	1934-07-04	Sallanches	France	False	1910	True
89	1917	Peace	The Nobel Peace Prize 1917	NaN	1/1	482	Organization	Comité international de la Croix Rouge (Intern...	NaT	NaN	...	NaN	NaN	NaN	NaN	NaT	NaN	NaN	False	1910	False
215	1944	Peace	The Nobel Peace Prize 1944	NaN	1/1	482	Organization	Comité international de la Croix Rouge (Intern...	NaT	NaN	...	NaN	NaN	NaN	NaN	NaT	NaN	NaN	False	1940	False
278	1954	Chemistry	The Nobel Prize in Chemistry 1954	"for his research into the nature of the chemi...	1/1	217	Individual	Linus Carl Pauling	1901-02-28	Portland, OR	...	Male	California Institute of Technology (Caltech)	Pasadena, CA	United States of America	1994-08-19	Big Sur, CA	United States of America	True	1950	False
283	1954	Peace	The Nobel Peace Prize 1954	NaN	1/1	515	Organization	Office of the United Nations High Commissioner...	NaT	NaN	...	NaN	NaN	NaN	NaN	NaT	NaN	NaN	False	1950	False
298	1956	Physics	The Nobel Prize in Physics 1956	"for their researches on semiconductors and th...	1/3	66	Individual	John Bardeen	1908-05-23	Madison, WI	...	Male	University of Illinois	Urbana, IL	United States of America	1991-01-30	Boston, MA	United States of America	True	1950	False
306	1958	Chemistry	The Nobel Prize in Chemistry 1958	"for his work on the structure of proteins, es...	1/1	222	Individual	Frederick Sanger	1918-08-13	Rendcombe	...	Male	University of Cambridge	Cambridge	United Kingdom	2013-11-19	Cambridge	United Kingdom	False	1950	False
340	1962	Peace	The Nobel Peace Prize 1962	NaN	1/1	217	Individual	Linus Carl Pauling	1901-02-28	Portland, OR	...	Male	California Institute of Technology (Caltech)	Pasadena, CA	United States of America	1994-08-19	Big Sur, CA	United States of America	True	1960	False
348	1963	Peace	The Nobel Peace Prize 1963	NaN	1/2	482	Organization	Comité international de la Croix Rouge (Intern...	NaT	NaN	...	NaN	NaN	NaN	NaN	NaT	NaN	NaN	False	1960	False
424	1972	Physics	The Nobel Prize in Physics 1972	"for their jointly developed theory of superco...	1/3	66	Individual	John Bardeen	1908-05-23	Madison, WI	...	Male	University of Illinois	Urbana, IL	United States of America	1991-01-30	Boston, MA	United States of America	True	1970	False
505	1980	Chemistry	The Nobel Prize in Chemistry 1980	"for their contributions concerning the determ...	1/4	222	Individual	Frederick Sanger	1918-08-13	Rendcombe	...	Male	MRC Laboratory of Molecular Biology	Cambridge	United Kingdom	2013-11-19	Cambridge	United Kingdom	False	1980	False
523	1981	Peace	The Nobel Peace Prize 1981	NaN	1/1	515	Organization	Office of the United Nations High Commissioner...	NaT	NaN	...	NaN	NaN	NaN	NaN	NaT	NaN	NaN	False	1980	False